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1.  Introduction 


1.1  Motivation 

The  U.S.  Army  has  identified  a  critical  need  to  fuse  information  horizontally  across  multiple  peer 
units  and  vertically  up  and  down  the  chain  of  command  to  create  real-time  and  near-real-time 
knowledge  at  all  levels  of  U.S.  Army  operations.  The  concept  of  fusing  information  from  a  wide 
variety  of  battlefield  assets  has  been  evolving  for  the  last  20  years  to  improve  command,  control, 
communications,  computers,  intelligence,  surveillance,  and  reconnaissance  (C4ISR)  capabilities 
for  tactical  wartime  and  contingency  operations,  situational  assessment  and  awareness,  and 
real-time  collaboration.  Most  recently,  the  science  and  technology  (S&T)  initiatives  identified  to 
enhance  the  U.S.  Army’s  ability  to  accomplish  its  emerging  Homeland  Security  mission  has 
underscored  the  need  to  fuse  multisource  information  to  provide  a  Common  Relevant 
Operational  Picture  built  on  shared  data. 

We  use  the  term  “knowledge  fusion”  to  emphasize  the  real-time  integration  and  use  of 
information  from  heterogeneous  civilian  and  military  sources  to  assist  the  human  process  of 
knowledge  creation.  This  concept  builds  on  the  emerging  field  of  multisensor  data  fusion 
technology.  Multisensor  data  fusion  focuses  on  the  integration  of  data  from  multiple  sensors  to 
achieve  more  accuracy  than  could  be  achieved  by  using  a  single,  independent  sensor  (Hall  and 
Llinas,  2001).  The  Joint  Directors  of  Laboratories  (JDL)  Data  Fusion  Working  Group  was 
established  in  1986  to  improve  communications  among  researchers  and  developers.  The 
two-layer,  hierarchical  JDL  process  model  that  resulted  from  its  efforts  identifies  the  processes, 
functions,  and  techniques  applicable  to  data  fusion.  The  1998  model  revision  broadly  defines 
data  fusion  as  “the  process  of  combining  data  or  information  to  estimate  or  predict  entity  states” 
(Steinberg  et  ah,  1998).  Excellent  overviews  of  multisensor  data  fusion  are  provided  by 
Hall  and  Llinas  (2001),  Waltz  and  Llinas  (1990),  and  Antony  (1995). 

Knowledge  fusion  also  builds  on  the  emerging  fields  of  multiagent  systems  (Subrahmanian, 
et  ah,  2000;  Genesereth  and  Fikes,  1992),  infonnation  integration  (Wiederhold,  1992,  2002), 
(Jhingran  et  al.,  2002);  and  the  semantic  web  (Fensel,  2001;  Fensel  et  al.,  2003a).  Knowledge 
fusion  focuses  on  the  integration,  system  techniques,  and  principles  that  increase  the  value  of 
information  produced  by  integrating  infonnation  from  multiple  real-time  and  non-real-time 
heterogeneous  sources  that  include,  but  are  not  limited  to,  sensors. 

In  the  battlespace,  information  is  continuously  arriving  from  sensors  of  various  types  on  different 
platforms.  The  output  of  these  sensors  along  with  verbal  reports,  electronic  mail  and  reports, 
background  knowledge,  current  breaking  web-based  news  reports,  and  other  types  of  information 
need  to  be  fused  into  a  coherent  picture  of  the  immediate  situation  so  that  the  best  courses  of 
action  may  be  determined.  In  practice,  the  terms  “knowledge  fusion”  and  “multisensor  fusion” 


1 


are  both  commonly  used  to  refer  to  advanced  technologies  for  integrating  data  from  multiple 
sources  for  rapid  and  effective  decision  making.  However,  areas  of  focus  and  techniques  for 
solving  the  problem  tend  to  stem  from  different  traditional  and  emerging  disciplines. 

Knowledge  fusion  information  sources  may  store  data  in  various  formats  within  databases,  files, 
or  on  web  pages.  Alternatively,  the  data  may  be  the  real-time  streaming  output  of  various 
devices.  Furthennore,  the  structuring  or  categorization  of  the  information  may  be  radically 
different  in  the  various  sources.  Additionally,  the  time  varying  nature  of  different  types  of 
information  may  be  different.  Some  data  are  relatively  static  while  other  sources  are 
continuously  changing.  The  goal  of  work  in  knowledge  fusion  is  to  develop  techniques  to 
intelligently  fuse  and  process  massive  amounts  of  heterogeneous  information  from  a  wide  range 
of  distributed  sources  so  that  it  may  be  used  for  decision  making  and  problem  solving. 

Knowledge  is  emerging  as  the  ultimate  in  force  multipliers.  As  information  technology  advances 
in  capabilities  and  security,  the  symbiosis  between  machine  and  human  increases  in  an 
unprecedented  way.  A  core  capability  in  the  concept  of  knowledge  fusion  is  support  of  human 
knowledge  processes,  accomplished  through  simplification  and  filtering  techniques,  and 
multimedia  presentation  techniques  that  can  be  tailored  to  the  preferences  and  skills  of  the 
human  receiver.  Technologies  to  support  these  capabilities  in  military  domains  differ  from  those 
in  civilian  domains  because  systems  on  the  battlefield  and  in  recovery  and  consequence 
management  (R&CM)  situations  must  accommodate  a  real-time,  on-the-move,  dynamic,  and 
extremely  high-stress  environment  with  common  operations  performed  by  humans  with  a  wide 
range  of  skills  and  tasks. 

1.2  Discussion  of  Terminology 

The  words  “data,”  “information,”  and  “knowledge”  are  used  in  varying  and  overlapping  ways 
in  not  only  everyday  discourse,  but  also  in  the  scientific  literature.  There  is  the  rapidly  evolving 
area  of  data  fusion  which  is  primarily  concerned  with  mathematical  techniques  to  combine  the 
output  of  sensors.  The  most  mature  area  of  multisensor  data  fusion  is  aimed  at  combining  sensor 
data  to  determine  the  position,  velocity,  attributes,  and  identity  of  individual  objects  to  identify 
and  maneuver  targets.  Within  the  computer  science  field  of  databases,  the  well-established  area 
of  data  and  database  integration  addresses  combining  data  sources.  This  field  studies  techniques 
to  combine  information  from  databases  with  different  schemas,  but  also  has  been  extended  to 
incorporate  information  from  less-structured  sources  such  as  files  and  web  pages.  The  area  of 
knowledge  fusion  includes  both  of  these  areas,  but  also  defines  and  addresses  processes  that 
increase  the  value  of  information  achieved  by  combining  and  relating  infonnation  from  a  wide 
variety  of  sources.  We  do  not  expect  to  combine  materialized  data  sources,  but  rather  to  bring 
together  only  selected  results  derived  from  these  sources.  We  focus  on  the  human  problem  of 
data  overload  caused  by  the  current  explosion  of  battlespace  data  from  local  and  remote  sensors, 
human  voice  and  email  input,  web-based  data,  and  multimedia  news  data.  The  knowledge  fusion 
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problem  is  finding  and  putting  together  pertinent  information  relevant  enough  to  make  informed 
decisions  and  take  swift  action  when  confronted  with  a  mountain  of  data. 

There  is  some  philosophically  oriented  literature  on  the  relationship  between  the  notions  of  data, 
information,  and  knowledge  (Dretske,  1999;  Devlin,  1991).  Although,  it  is  not  our  intention  to 
prescribe  the  way  these  terms  should  be  used,  a  mention  of  this  discussion  does  lend  some  clarity 
to  the  varying  approaches  to  be  discussed  in  this  report,  particularly  since  the  philosophical 
discussion  does  at  least,  in  a  very  loose  sense,  correspond  to  the  way  these  terms  are  generally 
used.  In  ordinary  usage,  there  is  certainly  a  continuum  from  data  to  information  and  then  to 
knowledge.  The  notion  of  data  is  the  most  primitive  (e.g.,  raw  data).  The  notion  of  information 
implies  more  interpretation  and  processing.  Finally,  knowledge  carries  the  implication  of 
general  principles  or  facts  of  a  propositional  form,  which  are  known  to  be  true. 

Devlin  (1999),  drawing  upon  the  work  of  Dretske  (1999),  situation  semantics  (Devlin,  1991), 
and  also  business  researchers  such  as  Davenport  and  Prusak  (1998)  argues  for  the  utility  of 
distinguishing  between  the  terms  in  the  following  manner.  Information  is  data  along  with 
meaning.  And  knowledge  is  information  which  has  been  internalized  and  associated  with  the 
ability  to  use  the  information.  Knowledge  is  then  an  individual  notion  as  it  exists  within  a 
person.  Similar  views  are  expressed  by  Brown  and  Duguid  (2000). 

The  approach  of  situation  semantics  aims  to  establish  a  Mathematics  of  Infonnation 
(Devlin  1991,  1999;  Barwise,  1989)  by  further  refining  the  view  of  information  as  data  along 
with  meaning.  The  core  idea  is  that  information  is  made  up  of  some  form  of  representation  and  a 
procedure  for  encoding  and  decoding  the  information.  The  encoding/decoding  procedure  is 
based  upon  constraints  between  the  representation  and  the  information  conveyed  about  the 
world.  An  important  outcome  of  this  approach  is  the  idea  that  information  can  only  be 
understood  or  decoded  if  contextual  situations  are  identified  (Devlin,  1999).  The  issue  of 
representing  and  maintaining  the  context  (e.g.,  temporal,  geographical,  tenninological,  etc.)  in 
which  data/information  are  gathered,  and  utilizing  that  contextual  information  to  combine 
various  sources  into  a  unified  picture,  is  one  that  in  various  fonns  underlies  much  of  the  work 
discussed  in  this  report. 

1.3  Outline  of  Report 

This  report  identifies  and  analyzes  technologies  relevant  to  U.S.  Army  knowledge  fusion  from  a 
variety  of  fields  such  as  multisensor  data  fusion,  intelligent  agents,  semantic  web,  infonnation 
integration,  databases,  artificial  intelligence,  systems  engineering,  and  others.  The  goal  is  to 
identify  and  evaluate  techniques  that  have  high-potential  applicability  to  Army  knowledge  fusion 
problems. 

In  section  2,  we  describe  the  U.S.  Anny’s  Future  Force,  the  vision  of  the  army  of  the  future.  The 
aim  of  knowledge  fusion  is  to  integrate  critical  information  to  enable  knowledgeable  action 
required  by  the  Army’s  Future  Force.  Section  3  describes  some  examples  of  Army  knowledge 
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fusion  issues.  Technologies  relevant  to  knowledge  fusion  are  identified  and  discussed  in 
section  4.  Finally,  technologies  of  importance  to  the  Army  knowledge  fusion  problem  are 
analyzed  and  summarized  in  section  5. 


2.  Background 


The  U.S.  Army’s  Future  Force  (Department  of  the  Army,  2004;  Williams,  2003;  U.S.  Army 
Training  and  Doctrine  Command,  2002),  the  army  of  the  21st  century,  is  being  designed  to  be 
joint,  interagency,  and  multinational  (JIM).  Land  power  plays  a  critical  role  in  dominating  the 
highly  complex  land  environment  that  comprises  the  heart  of  most  joint  operations.  There  is  an 
enduring  and  unavoidable  challenge  to  control  terrain,  people,  and  resources  on  land,  where 
political  authority  resides.  The  Future  Force  is  to  be  an  instrument  that  is  precise,  maneuverable, 
and  flexible.  It  must  be  able  to  fight  on  a  battlefield  that  is  multidimensional,  dispersed, 
continuous  and  noncontiguous  in  nature. 

The  units  to  be  deployed  must  be  tailored  to  the  operation  at  hand.  The  units  will  incorporate  the 
appropriate  equipment  and  personnel  for  the  operation  drawn  from  multiple  U.S.  agencies  as 
well  as  those  of  other  countries.  The  operation  of  the  units  are  network-centric  in  that 
commanders  in  different  places  can  obtain  relevant  information. 

A  crucial  role  is  to  be  played  by  the  distributed  information  environment  called  the  Global 
Information  Grid  (GIG),  which  is  to  provide  links  from  the  “factory  to  foxhole”  and  “space  to 
mud.”  The  architecture  of  GIG  needs  to  be  designed  to  create  a  knowledge-based  force  in  which 
soldiers  receive  and  send  the  right  infonnation  at  the  right  place  and  time.  This  infonnation 
environment  is  to  facilitate  the  transformation  of  data  into  knowledge. 

In  this  section,  the  demands  of  the  battlespace,  sustaining  the  base,  and  also  homeland  security 
are  discussed  in  the  context  of  the  Army’s  Future  Force  and  GIG. 

2.1  Battlespace 

The  Army’s  Future  Force,  supported  by  GIG,  will  carry  out  Network  Centric  Warfare  (NCW) 
which  is  defined  as 

an  infonnation  superiority-enabled  concept  of  operations  that  generates 
increased  combat  power  by  networking  sensors,  decision  makers,  and  shooters 
to  achieve  shared  awareness,  increased  speed  of  command,  a  higher  tempo  of 
operations,  greater  lethality,  increased  survivability,  and  a  degree  of 
self-synchronization  (Alberts  et  al.,  2002). 

Through  the  linking  of  knowledgeable  entities  in  the  battlespace,  NCW  is  to  convert  information 
superiority  into  combat  superiority.  The  backbone  of  this  information  system  is  an  integrated 
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communications  system.  This  includes  both  wireless  and  wired  networking.  The  network 
connects  all  battlespace  entities — sensors,  actors,  and  decision  makers. 

In  the  network-centric  model,  sensors  and  actors  are  decoupled,  unlike  the  traditional 
platform-centric  model.  Sensors  from  one  platform  may  provide  information  that  enters  the 
network  and  is  ultimately  used  by  a  variety  of  other  actors  and  decision  makers.  Additionally, 
new  types  of  sensors  are  available  and  need  to  be  effectively  used. 

Sensors  designed  to  sense  new  things  and  maneuver  in  close  to  make  distinctions 
among  things  we  cannot  now  distinguish — The  net  result  of  all  of  these  changes 
will  be  the  proliferation  of  lower  cost,  independent  sensors  and  actors  that  will 
contribute  to  and  depend  more  upon  distributed  rather  than  embedded 
intelligence  (Alberts  et  ah,  2002). 

Information  collected  by  the  sensors  needs  to  be  put  in  a  form  so  that  it  can  be  used  by  other 
battlespace  entities.  The  other  entities  need  to  fuse  the  information,  interpret  it  in  context  and 
understand  the  implications  (Alberts  et  ah,  2002).  Note  that  the  knowledge/information  fusion 
issue  arises  immediately  here  because  sensor  infonnation  from  a  sensor  designed  for  use  by  a 
particular  battlespace  entity  is  being  made  available  for  wider  use.  Additionally,  we  also  see  the 
importance  of  movement  or  positioning  for  the  purpose  of  infonnation  gathering. 

The  ultimate  goal  of  this  fusing  is  the  construction  of  common  relevant  operational  picture 
(CROP)  (U.S.  Joint  Forces  Command,  2004;  Falvo,  2004).  The  CROP  will  provide  a 
presentation  of  timely,  fused,  accurate,  and  relevant  information  that  can  be  tailored  to  meet  the 
requirements  of  the  Joint  force  commander  and  the  Joint  force,  and  is  common  to  every 
organization  and  individual  involved  in  a  joint  operation.  A  CROP  facilitates  collaborative 
planning  and  assists  all  echelons  to  achieve  situational  awareness. 

Providing  battlespace  awareness  to  war  fighters  across  the  Joint  force  with 
requisite  accuracy  and  timeliness  requires  that  data  and  infonnation  from  multiple 
sources  be  collected,  processed  (analyzed  when  necessary),  transported,  fused, 
placed  in  appropriate  contexts,  and  presented  in  ways  that  facilitate  rapid  and 
accurate  inferences.  It  also  requires  that  actors  and  decision  entities  be  provided 
by  training  with  internal  models  and  or  decision  aids  or  models.  With  this  insight, 
we  can  observe  that  it  requires  both  battlespace  awareness  and  these  cognitive 
models  to  generate  battlespace  knowledge  which  is  in  and  of  itself,  an  emergent 
network-centric  property  (Alberts  et  ah,  2002). 

The  integrated  information  found  in  the  CROP  includes  (Alberts  et  ah,  2002)  the  following: 

1.  The  location  of  the  various  friendly  forces,  enemy  sources,  and  other  entities. 

2.  The  available  courses  of  action  for  both  friendly  forces  and  enemy  forces. 

3.  The  predicted  future  actions  and  positions  for  all  forces. 
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4.  The  status  and  characteristics  of  various  units. 


5.  The  environment,  including  the  terrain  and  weather  conditions. 

Battlespace  awareness  results  from  the  fusion  of  key  types  of  information  that  describe  the 
battlespace.  These  include  positions  of  forces  and  geography.  But  battlespace  knowledge 
consists  of  tacit  information  that  requires  interpretations.  Examples  are  intent  and  capabilities, 
and  tactics  of  the  various  forces.  The  information  used  to  build  the  CROP  will  come  from  a 
variety  of  sources.  It  will  not  only  include  the  battlespace  sensors  but  also,  for  example,  the 
information  available  on  news  networks  such  as  CNN. 

Yet  another  aspect  of  NCW  is  to  capitalize  upon  the  fact  that  the  actor  entities  are 
knowledgeable.  Command  and  control  (C2)  may  be  replaced  by  command  and  coordination. 
Thus  the  desired  results  may  be  achieved  without  the  issuance  of  top-down  detailed  instructions. 

That  is  to  say  that  organizational  behavior  could  be  consciously  designed  to  be  an 
emergent  property  that  derives  from  the  commander’s  intent,  as  internalized  by 
actor  entities,  the  degree  of  battlespace  knowledge  available  and  the  ability  of 
decision  entities  to  minimize  the  constraints  imposed  on  actor  entities  by  virtue  of 
the  resources  allocated  to  actor  entities.  It  is  hard  to  overestimate  the  impact  that 
this  new  dimension  of  command  and  control  will  have  on  the  way  we  will 
approach  operations  in  the  future  (Alberts  et  ah,  2002). 

How  to  facilitate  and  control  the  execution  of  such  a  loose  and  distributed  plan  or  task  is  an  open 
question. 

Additionally,  it  is  expected  that  use  will  be  made  of  battlespace  agents,  which  perform  selected 
tasks  as  delegated  by  the  decision  and  actor  entities.  We  not  only  have  links  between  sensors, 
actors,  and  decision  entities,  but  battlespace  agents  also  play  a  role.  These  agents  are  automated 
decision  or  information  processes  (Alberts  et  ah,  2002). 

Some  of  the  tasks  that  may  be  given  to  battlespace  agents  are  as  follows: 

1 .  Request  additional  infonnation  that  is  required  by  the  situation. 

2.  Task  the  sensors. 

3.  Notify  the  decision  entities  that  something  requires  immediate  attention. 

4.  Translate  a  commander’s  intent  into  instructions. 

5.  Resolve  inconsistencies  within  the  CROP. 

In  summary,  the  vision  of  the  future  U.S.  Army  reveals  the  crucial  importance  of  knowledge  and 
information  fusion  in  the  information  environment  needed  to  support  the  Joint  operational 
activities.  Massive  amounts  of  continuously  arriving  information  from  a  variety  of  sensors  and 
information  sources,  in  a  variety  of  formats  and  from  various  locations  and  perspectives,  must  be 
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combined  into  a  unified  picture  that  is  able  to  discern  and  present  only  that  relevant  set  of 
information  that  the  warrior  needs  to  make  informed  decisions  and  take  appropriate  action.  The 
fact  that  the  CROP  is  available  to  a  variety  of  battlespace  agents  produces  the  potential  for  a  new 
form  of  C2  through  coordination  rather  than  exclusively  and  centrally  directed  control. 
Automated  agents  and  robots  are  to  be  well  integrated  into  this  networked  system  in  a  way  that 
results  in  a  symbiotic  relationship  between  warrior  and  machine. 

2.2  Sustaining  Base 

Some  issues  found  in  the  commercial  sector  also  arise  in  the  Army’s  sustaining  base  domain,  but 
with  significantly  different  constraints  and  requirements  due  to  the  fact  that  the  sustaining  base 
supports  a  rapidly  changing  battlesphere  with  high  human  risk  factors. 

Inventory  is  stored  at  various  locations,  and  as  it  is  used,  orders  need  to  be  made  to  replenish  the 
supply.  Information  on  the  nature  of  the  items  stored  in  various  locations  is  kept  in  different 
formats  on  various  types  of  databases.  While  as-needed  work  is  being  carried  out,  infonnation 
systems  need  to  ensure  that  all  the  required  steps  (e.g.,  maintenance  operations)  are  performed. 

One  major  difference  between  the  military  case  and  the  commercial  sector  is  the  interaction  with 
the  battlespace.  For  example,  equipment  must  be  stored  at  appropriate  locations  in  anticipation 
of  events  in  the  battlespace,  and  events  in  the  battlespace  will  initiate  supply  efforts  on  the 
sustaining  base.  The  events  of  the  battlespace  are  more  difficult  to  predict  than  those  that  occur 
in  the  private  sector.  It  may,  for  example,  be  necessary  to  move  large  numbers  of  soldiers  from 
one  location  to  another  with  very  short  notice  and  have  all  the  necessary  arrangements 
(e.g.,  food,  housing,  equipment,  transportation)  made. 

Furthermore,  the  fact  that  the  Army’s  Future  Force  is  to  draw  equipment  and  personnel  for 
operations  from  multiple  U.S.  agencies  as  well  as  those  from  other  countries  means  that  the 
information  integration  task  will  be  immense.  It  is  to  be  expected  that  the  tenninology  and 
categories  used  to  describe  available  equipment  and  personnel  will  be  radically  different  between 
the  Army  and  the  various  U.S.  and  foreign  agencies  with  which  efforts  need  to  be  coordinated. 

Finally,  delay  and  error  can  result  in  loss  of  life.  Information  concerning  the  status  of  sustaining 
base  operations  can  assist  the  enemy,  with  potential  negative  impact  of  the  operation  that  again 
includes  resulting  loss  of  life. 

In  summary,  some  sustaining  base  knowledge  fusion  tasks  have  similarities  with  those  found  in 
more  traditional  information  integration  tasks  that  arise  in  the  commercial  sector.  But  the 
military  domain  augments  the  challenges  of  these  fusion  tasks  considerably  and  adds  other 
critical  dimensions  because  of  the  rapidly  changing  nature  of  the  battlespace,  the  high  need  for 
security,  and  the  significant  impact  of  delay  and  error. 


7 


2.3  Homeland  Security 


As  is  now  well  known,  prior  to  9/1 1/01,  there  was  infonnation  available  which  could  have 
possibly  alerted  us  to  the  impending  attacks.  The  question  arises  as  to  whether  an  advanced 
information  system  could  have  put  all  the  pieces  together,  predicted  the  attack,  and  allowed 
countermeasures  to  be  taken. 

There  is  an  immense  amount  of  data  available;  some  in  public  sources  and  other  bodies  of  data 
explicitly  gathered  by  different  governmental  agencies.  The  number  of  possible  types  of  threats 
is  immense.  Additionally,  these  can  occur  at  many  different  locations  and  at  any  time.  Issues  of 
knowledge  and  infonnation  fusion  arise  in  putting  the  different  pieces  together  to  get  a  unified 
picture  of  where  the  likely  threats  are. 

Again,  one  sees  the  traditional  information  integration  issue  here.  But  the  amount  of  data  is  vast 
and  continuously  changing.  There  is  really  a  knowledge  fusion  problem  beyond  that  of 
information  integration.  The  question  is  how  can  the  crucial  pieces  of  infonnation  be  identified 
(from  among  the  vast  amounts  of  unimportant  information)  and  put  together  so  that  action  can  be 
taken. 

2.4  Some  Issues 

Certainly,  traditional  infonnation  integration  issues  concerning  how  to  combine  information 
from  heterogeneous  structured  (e.g.,  databases)  and  unstructured  (e.g.,  files  and  web  pages) 
information  sources  is  of  importance  to  addressing  the  knowledge  fusion  issues  that  are  to  arise 
in  the  U.S.  Army  of  the  future.  Given  the  nature  of  the  Anny’s  Future  Force,  i.e.,  the  utilization 
of  carefully  tailored  units  with  membership  drawn  not  only  from  the  U.S.  Army,  but  also  from 
foreign  military  forces  as  well  as  various  U.S.  Governmental  agencies,  there  will  be  a 
particularly  massive  information  integration  problem. 

Given  the  nature  of  NCW,  the  amount  of  data  arriving  from  all  available  sensors  will  be  massive 
and  continuously  arriving.  The  techniques  of  data  fusion  are  certainly  needed.  Given  the 
number  of  sensors  available,  the  issue  of  how  can  one  automatically  determine  when  a  change  in 
the  data  (continuously  streaming  data)  is  significant.  Additionally,  in  NCW,  the  sensors  used  by 
a  particular  actor  are  not  at  the  same  position  as  the  actor.  The  infonnation  from  the  various 
sensors  need  to  be  combined  to  give  the  knowledge/information  needed  by  the  actor  to  carry  out 
its  task(s)  and  achieve  its  goals.  Note  that  sensors  may  be  moved  to  particular  locations 
(independently  of  the  actors)  in  order  to  obtain  information  that  would  satisfy  particular 
knowledge  goals.  These  knowledge  goals  are  necessary  to  satisfy  the  overall  battlespace  goals. 

It  is  necessary  to  represent  knowledge  goals  and  develop  techniques  for  obtaining  plans  to 
achieve  the  knowledge  goal. 

There  is  an  added  issue  of  the  representation  of  plans.  When  a  crucial  piece  of  information 
arrives,  indicating  that  a  particular  event  has  occurred,  how  is  it  possible  to  identify  the  plans  or 
ongoing  activities  that  are  affected  by  this  event?  What  changes  then  need  to  be  made? 
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Additionally,  it  is  expected  that  knowledgeable  actors  can  self-synchronize  rather  than  being 
directed  from  the  top.  How  is  it  possible  to  represent  the  ongoing  plan  or  activity  in  such  a  way 
that  this  self-synchronization  is  possible? 

There  may  be  a  wide  variety  of  automated  tools  or  agents  available  on  the  network.  Somehow, 
these  need  to  be  categorized  and  set  up  in  such  a  way  that  the  particular  services  that  they 
provide  can  be  determined  either  automatically  or  by  a  human  user.  Techniques  for  doing  so 
need  to  be  identified  and/or  developed. 

A  number  of  areas  of  computer  science  that  are  certainly  relevant  to  the  Army’s  Future  Force 
and  NCW  have  not  been  included.  These  are  as  follows: 

1 .  Issues  related  to  human-computer  interfaces  and  human  factors  relevant  for  the  design  of 
CROP. 

2.  Computer/network  security  issues. 

These  issues  are  certainly  important,  but  the  literature  related  to  these  issues  is  vast  and  outside 
the  scope  of  this  report. 


3.  Motivating  Examples 


It  is  useful  to  have  some  detailed  examples  of  knowledge  fusion  issues  that  might  arise  in  the 
operations  of  the  U.S.  Anny  to  facilitate  work  on  this  topic.  This  section  describes  several 
places  in  the  literature  where  the  beginnings  of  such  scenarios  are  found. 

3,1  Movement  Analysis  Battlespace  Challenge  Problem 

The  Defense  Advanced  Research  Projects  Agency  (DARPA)  High  Performance  Knowledge 

Bases  (HPKB)  Project  used  a  number  of  problem  types  and  specific  scenarios 

(Cohen  et  ah,  1998).  One  of  these  was  the  Movement  Analysis  Battlespace  Challenge  Problem. 

The  goal  of  this  problem  was  to  understand  sensor  data  pertaining  to  the  movement  of  vehicles. 
The  sensor  data  were  an  idealized  version  of  that  produced  by  airborne  Joint  Surveillance  Target 
Attack  Radar  System  (JSTARS).  This  was  supplemented  with  some  intelligence  reports 
concerning  when  enemy  radar  has  been  turned  on,  information  on  enemy  communications  and 
human  intelligence.  Information  on  the  geography,  roads,  and  the  composition  of  enemy  forces 
in  the  region  was  also  given. 

The  task  of  the  movement  analysis  involved  the  following: 

1 .  Distinguish  military  from  nonmilitary  traffic. 
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2.  Identify  the  sites  between  which  the  military  traffic  is  moving  and  determine  their  types 
(e.g.,  battle  position,  support  area,  artillery  site)  and  significance. 

3.  Identify  which  units  are  participating  in  each  movement  (convoy). 

4.  Detennine  the  purpose  of  the  convoy  (e.g.,  reconnaissance,  engage  in  battle,  etc.). 

5.  Detennine  the  exact  types  of  the  vehicles. 

See  Cohen  et  al.  (1998)  for  more  details. 

Solutions  to  this  type  of  scenario  would  directly  serve  the  needs  of  the  Anny’s  Future  Force. 

The  construction  of  the  CROP  in  network  centric  warfare  calls  for  this  kind  of  interpretation  of 
enemy  actions. 

3.2  Workaround  Battlespace  Challenge  Problem 

The  purpose  of  this  problem,  also  from  DARPA  HPKB  Project,  was  to  devise  a  system  to  help 
determine  targets  for  destruction  by  determining  how  rapidly  the  enemy  can  bypass 
(workaround)  the  damage  to  the  target  (Cohen  et  al.,  1998).  Although,  the  researchers  in  the 
HPKB  project  had  primarily  air  strikes  in  mind,  the  same  sort  of  example  can  be  used  to  reason 
about  the  choice  of  targets  for  destruction  or  capture  by  ground  forces. 

Input  to  the  task  includes  the  following: 

1 .  Description  of  the  target,  the  damage  to  it,  and  key  features  of  the  geography  of  the  area. 

2.  The  purpose  of  destroying  the  target,  e.g.,  to  prevent  a  particular  enemy  unit  from  moving 
to  a  particular  position.  Additionally,  the  input  includes  the  time  period  during  which  the 
enemy  capability  needs  to  be  interdicted. 

3.  Description  of  the  enemy  resources. 

The  output  of  the  workaround  generator  produces  a  reconstitution  schedule,  giving  the  capacity 
of  the  enemy  to  either  repair  the  target  or  to  find  another  way  (workaround)  of  accomplishing  the 
activity  mentioned  in  point  2.  The  workaround  generator  needed  substantial  information  about 
the  engineering  capabilities  of  the  enemy.  Again,  solutions  to  this  type  of  scenario  are  of  direct 
relevance  to  the  problem  of  supporting  the  Army’s  Future  Force.  Kingston  (2001)  contrasts 
several  approaches  to  this  problem. 

3.3  Joint  Warrior  Interoperability  Demonstration  2003 

Another  example  is  the  Joint  Warrior  Interoperability  Demonstration  2003  scenario  (Office  of 
the  Army,  2004).  The  scenario  is  set  in  an  area  of  operations  deep  in  the  Pacific  Ocean,  centered 
on  a  fictitious  island  called  Tindoro.  Tindoro  is  divided  due  to  a  turbulent  political  history.  The 
North,  the  opposition  force,  has  created  a  province  that  is  politically  and  economically  attached 
to  the  mainland.  The  South  is  an  independent  nation  with  friendly  relations  with  a  nation  called 
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Rabenneste.  It  controls  the  Fingal  Enclave  located  within  the  North  territory.  As  the  action 
begins,  the  region  has  been  destabilized  to  the  point  of  needing  United  Nations  (UN) 
intervention.  A  conflict  occurs  for  a  period  of  220  days.  The  UN  inserts  substantial  force  into 
the  South,  and  captures  Flinders  Island.  The  North  attacks  Fingal  Enclave.  The  UN 
Multinational  Task  Force  (MTF)  recaptures  the  Enclave.  Peace  Keeping  Operation  forces  enter 
the  South  and  provide  humanitarian  aid.  MTF  consolidates  forces  in  the  South  and  enforces 
peace  keeping  agreements.  This  scenario  demands  the  ability  for  43  coalition  partners  in 
wide-spread,  multiple  geographical  locations  to  share  relevant  operational  infonnation  reliably 
and  quickly.  These  nations  are  not  necessarily  linked  by  treaty.  MTF  staff  officers  must  have 
secure  and  reliable  access  to  near  real-time  situational  awareness. 

3.4  Homeland  Security  Problem 

A  good  example  for  homeland  security  is  exactly  what  happened  prior  to  9/1 1/01.  The  recently 
released  Congressional  Report  contains  much  material.  The  question  is  what  type  of  information 
system  would  have  put  things  together  and  issued  warnings.  The  data  would  include  the  Phoenix 
Report  from  July  of  2001,  a  memo  from  an  FBI  agent  in  the  Phoenix  field  office  indicating  that 
attention  should  be  paid  to  individuals  from  the  middle  east  who  have  enrolled  in  U.S.  flight 
schools.  Additional  information  would  be  money  transfers  such  as  that  from  Ramzi  Bin  al-Shibh 
(who  is  assumed  to  be  a  member  of  a  Gennan  al  Qaeda  cell)  to  Zacarias  Moussaoui  who  was 
enrolled  in  a  flight  school  in  Minnesota.  This  type  of  data  and  the  various  sources  of 
information,  where  it  was  available,  could  be  developed  into  a  testbed  problem  for  knowledge 
fusion  efforts. 


4.  Technologies 


Technologies  relevant  to  constructing  the  information  system  needed  to  support  Army 
knowledge  fusion  are  quite  broad,  covering  many  areas  from  multiple  disciplines,  such  as 
Sensors,  Databases,  Artificial  Intelligence,  Systems  Engineering,  etc.  We  have  divided  up  the 
literature  into  manageable  units.  To  some  extent,  the  determination  of  units  is  arbitrary  and 
others  may  prefer  a  different  division.  But  nevertheless,  we  feel  that  the  categorization  chosen 
here  helps  in  understanding  the  wide  variety  of  techniques  which  are  available  to  address  the 
knowledge  fusion  problem.  Sections  4. 1  and  4.2  contain  the  relatively  well-established  areas  of 
Data  Fusion  and  Information  Integration.  The  two  areas  of  Ontologies  and  also  the  Semantic 
Web  are  grouped  together  in  section  4.3.  Section  4.4  contains  the  somewhat  disparate  set  of 
topics  grouped  together  under  the  label  Advanced  Representation  and  Reasoning.  These  include 
planning,  plan  recognition,  logic  programming,  Bayesian  networks,  and  real-time  problem 
solving.  A  number  of  topics  related  to  databases  did  not  properly  fit  into  any  of  the  earlier 
sections.  These  include  research  on  querying  streams,  querying  unstructured  infonnation 
sources,  web  mining,  and  data  mining.  They  are  in  section  4.5,  labeled  Contemporary  Database 
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Topics.  Finally,  in  section  4.6,  topics  related  to  the  design  of  the  overall  architecture  of  the 
information  system  to  support  the  Army’s  Future  Force  are  covered. 

4,1  Data  Fusion 

Data  fusion  is  a  name  for  work  carried  out  primarily  with  military  applications  in  mind.  The 
typical  problems  are  object  identification,  combining  information  from  multiple  sensors,  object 
tracking,  threat  assessment,  and  so  on.  The  methods  utilized  by  work  in  this  area  are  generally 
numerical  and  mathematical  techniques  based  on  linear  algebra,  probability  theory,  stochastic 
processes,  and  statistics.  The  overall  problem  has  been  defined  as  follows: 

Locate  and  identify  an  unknown  number  of  unknown  objects  of  many  different 
types  on  the  basis  of  different  kinds  of  evidence.  This  evidence  is  collected  on  an 
ongoing  basis  by  many  possibly  re-allocatable  sensors  having  varying 
capabilities.  Analyze  the  results  in  such  a  way  as  to  supply  local  and  over-all 
assessments  of  the  significance  of  a  scenario  and  to  determine  proper  responses 
based  on  those  assessments  (Goodman  et  al.,  1997). 

As  noted  by  Goodman  et  al.  (1997),  this  definition  could  apply  to  the  concerns  of  other  areas 
such  as  pattern  recognition  and  related  aspects  of  artificial  intelligence,  e.g.,  those  described  in 
Duda  et  al.  (2001). 

In  the  typical  military  setting,  there  are  various  platfonns  with  sensors  of  different  types.  From 
such  information,  the  data  fusion  system  needs  to  produce  reasonable  hypotheses  concerning  the 
actual  truth  on  the  ground.  How  many  vehicles  are  there?  What  are  their  positions?  What  are 
their  velocities  and  directions  of  movements?  What  are  their  identities  and  military  status 
(enemy,  allied,  or  neutral)?  The  amount  of  data  is  vast  and  continuous.  Various  mathematical 
techniques  have  been  developed  to  deal  with  this  problem. 

These  techniques  include  the  Kalman  filter,  Multi-Hypothesis  Estimation  (MHE)  filter,  Joint 
Probabilistic  Data  Association  (JPDA)  filter,  the  Mori,  Chong,  Tse,  and  Wishner  (MCTW)  filter, 
Point-Process  Filter,  Hidden  Markov  Model  (HMM)  Filters  and  many  others  as  well. 
Additionally,  the  techniques  include  the  use  of  Belief  Functions  (Dempster-Shafer  Theory  of 
Evidence),  random  sets,  event  algebras,  neural  networks,  and  also  Bayesian  techniques 
(Robert,  2001;  Gelrnan  et  al.,  2000).  Simulation  and  optimization  techniques  are  also  used, 
i.e.,  to  determine  sensor  placement  (Brown  and  Schaumburg,  2001).  The  literature  is  vast  and  a 
detailed  comparison  of  the  various  techniques  cannot  be  given  here  (see  Hall,  1992;  Waltz  and 
Llinas,  1990;  Sadjadi,  1996;  Bar-Shalom  and  Fortmann,  1998;  Klein,  1993). 

Some  recent  efforts  towards  improving  these  techniques  by  adding  world/domain  knowledge  are 
due  to  Donald  Brown  and  his  collaborators  (Power  and  Brown,  2002;  Sobiesk  et  al.,  1998; 
Barker  et  al.,  1998;  Bovey  and  Brown,  2000).  The  types  of  knowledge  that  they  incorporate 
include  knowledge  of  the  environment  (e.g.,  terrain)  and  also  characteristics  of  the  entities 
operating  in  the  environment  (e.g.,  membership  in  a  group  such  as  a  convoy).  Improved 
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performance  results  from  incorporating  this  additional  infonnation.  This  work  is  significant  in 
that  it  forms  a  bridge  between  the  traditional  data  fusion  activities  and  the  higher-level 
techniques  to  be  discussed  next. 

Donald  Brown  and  his  collaborators  have  also  used  the  techniques  of  data  fusion  to  predict 
criminal  incidents  on  the  basis  of  past  occurrences  and  to  determine  which  crimes  are  likely  to 
have  been  committed  by  the  same  individuals  (Brown,  1998;  Liu  and  Brown,  1998;  Brown  and 
Hagen,  1999;  Brown  and  Oxford,  2001;  Xue  and  Brown,  2003).  Essential  to  this  work  is  a 
mathematical  measure  of  the  similarity  of  incidents.  Clearly  the  same  techniques  can  be  applied 
to  the  prediction  of  incidents  such  as  terrorist  attacks  or  military  strikes. 

A  general  model  of  data  fusion,  the  Joint  Directors  of  Labs  (JDL)  model,  is  laid  out  by  White 
(1988)  and  then  revised  by  Steinberg  et  al.  (1998).  The  later  paper  defines  data  fusion  very 
generally  as  the  process  of  combining  data  to  refine  state  estimates  and  predictions.  A  number  of 
levels  of  data  fusion  are  proposed  by  Steinberg  et  al.  (1998)  to  categorize  the  different  types  of 
problems  addressed  in  data  fusion  work.  The  levels  are  as  follows: 

Level  0:  Sub-Object  Data  Assessment,  estimation  and  prediction  of  signal/object 
observable  states  on  the  basis  of  pixel/signal  level  data  association  and  characterization. 

Level  1 :  Object  Assessment,  estimation  and  prediction  of  entity  states  on  the  basis  of 
observation-to-track  association,  continuous  state  estimation,  and  discrete  state  estimation. 

Level  2:  Situation  Assessment,  estimation  and  prediction  of  relations  among  entities. 

Level  3:  Impact  Assessment,  estimation  and  prediction  of  effects  on  situations  of  planned 
or  estimated/predicted  actions  by  participants. 

Level  4:  Process  Refinement,  adaptive  data  acquisition  and  processing  to  support  mission 
objectives. 

Level  0  involves  things  like  detecting  signals  and  finding  features  in  an  image.  Level  1  involves 
the  identification  of  physical  objects,  possibly  by  drawing  upon  other  fusion  processes,  i.e., 
tracks.  Groupings  of  these  objects  are  postulated  under  level  2.  Level  3  aggregates  or  relates  the 
groupings  of  entities  from  level  2  into  plans  and  determines  how  they  might  effect  the  plans  of 
other  parties.  Finally,  level  4  determines  what  actions  should  be  taken  perhaps  as  part  of  a  new 
plan.  Steinberg  et  al.  (1998)  state  clearly  that  the  information  flow  may  go  in  multiple 
directions,  not  just  from  level  1  to  level  2  to  level  3  and  then  to  level  4.  Additionally,  they  point 
out  that  particular  applications  may  call  for  a  different  partition  of  functions. 

Clearly,  the  JDL  architecture  is  motivated  by  the  integration  of  sensor  data.  It  may  not  be 
immediately  suitable  for  the  integration  of  other  sorts  of  infonnation  such  as  the  contents  of 
databases.  Additionally,  the  higher  up  one  goes  in  levels  (from  2  to  3,  and  certainly  to  4);  the 
less  applicable  are  the  traditional  mathematical  methods  of  Data  Fusion.  Yet  the  overall  goals 
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are  not  very  different  from  that  of  Knowledge  Fusion.  The  model  seems  to  call  for  or  need  the 
techniques  that  will  be  discussed  in  the  remainder  of  this  report. 

4.2  Information  Integration 

The  goal  of  work  in  information/data  integration  is  to  provide  unifonn  access  to  multiple  sources 
of  information.  The  idea  is  that  the  user  should  only  have  to  focus  on  specifying  what  is  wanted 
and  to  let  the  system  worry  about  how  to  obtain  the  answers  (Levy,  1998,  1999;  Levy  and 
Weld,  2000).  Although  initial  work  in  this  area  considered  only  databases,  the  work  was  quickly 
generalized  to  include  web  sources  as  well. 

Information/data  integration  work  has  been  primarily  carried  out  by  database  researchers.  An 
introductory  presentation  of  issues  of  information  integration  may  be  found  in  Chapter  20  of 
Garcia-Molina  et  al.  (2002).  Surveys  of  earlier  work  in  this  area  include  Batin i  et  al.  (1986)  and 
the  more  recent  work  of  Hull  (1997).  Wiederhold  (1992)  introduced  the  important  notion  of 
mediators,  i.e.,  modules  that  mediate  between  the  users’  workstations  and  data  sources,  using 
knowledge  to  transform  data  into  infonnation. 

The  literature  of  the  past  decade  describes  a  number  of  projects  in  this  area: 

1.  TSIMMIS  (Garcia-Molina  et  al.,  1997)  is  an  infonnation  integration  architecture  based  on 

mediators.  The  architecture  includes  the  following: 

a.  The  OEM  data  model,  an  object  oriented  data  model  that  is  self  describing  (Nestorov 
et  al.,  1997;  Papakonstantinou  et  al.,  1995).  It  is  especially  suitable  for  semistructured 
hierarchical  data  such  as  that  found  on  the  World  Wide  Web. 

b.  Mediators  and  wrappers  and  tools  for  generating  them.  A  system  for  declaratively 
specifying  mediators  has  been  developed  (Papakonstantinou  et  al.,  1996). 
Papakonstantinou  et  al.  (1995)  describe  a  toolkit  for  the  development  of  wrappers 
(software  modules  that  extract  data  from  a  source). 

c.  A  query  language,  LOREL,  especially  designed  to  query  semistructured  data  or  data 
from  several  heterogeneous  data  sources.  The  queries  can  operate  over  data  having 
different  types  and  also  when  some  of  the  data  is  absent  (Abiteboul  et  al.,  1997). 

The  approach  used  in  TSIMMIS  has  been  described  as  “Global  As  View”  (GAV)*  because  the 
mediated  or  global  schema  is  described  as  a  set  of  database  views  (i.e.,  queries)  over  the  different 
information  sources.  The  amount  of  computation  needed  to  answer  a  query  is  thus  minimized, 
but  it  is  more  difficult  to  add,  delete,  or  modify  information  sources. 


Another  term  is  query-centric. 
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2.  SIMS:  The  SIMS  project  basically  takes  a  planning  approach  to  the  problem 

(Arens  et  ah,  1993,  1996).  Given  a  query,  the  planner  considers  alternative  ways  to  find 
the  data  required  to  satisfy  the  query. 

3.  InfoMaster:  The  approach  used  here  is  “Local  As  View”  (LAV).*  For  every  data  source, 
it  is  necessary  to  describe  the  infonnation  in  the  source  by  writing  a  rule  over  the  relations 
in  the  mediated  (global)  schema  (Duschka  and  Genesereth,  1997a,  1997b).  Query 
reformulation  is  therefore  more  difficult  and  requires  a  certain  amount  of  inferencing.  But 
adding  additional  data  sources  is  easier. 

4.  InfoQuilt:  (Thacker  et  ah,  2003)  is  a  platform  for  answering  complex  infonnation  requests 
over  data  available  on  the  web.  Ontologies  (see  section  4.3)  are  used  to  describe 
knowledge  about  domains  and  the  relationship  between  them.  InfoQuilt  uses  the  LAV 
approach  for  representing  the  infonnation  contained  by  various  sources.  It  has  a  number  of 
tools  to  facilitate  the  construction  of  the  knowledge  base  by  the  user. 

5.  HERMES:  The  HERMES  system,  developed  by  V.  S.  Subrahmanian  and  collaborators 
provides  a  particularly  rich  language  for  writing  mediators  (Subrahmanian  et  ah,  1995; 
Subrahmanian,  1994;  Adali  and  Subrahmanian,  1994,  1996;  Lu  et  ah,  1995,  1996; 

Adali  and  Emery,  1995).  The  language  is  based  upon  Annotated  Logic  Programming  due 
to  Kifer  and  Subrahmanian  (1992). 

6.  Information  Manifold:  (Levy  et  ah,  1996a,  1996b;  Kirk  et  ah,  1995).  This  work  is 
designed  to  provide  an  interface  to  structured  infonnation  sources  available  on  the  internet. 
The  sources  seem  to  be  generally  databases.  The  contents  of  sources  are  described  in  a 
language  called  GARIN  (Levy  and  Rousset,  1996,  1998)  that  combines  Datalog  and  a 
description  logic  (see  section  4.3).  The  system  computes  a  query  plan  to  answer  a 
particular  query  given  the  description  of  the  available  information  sources.  Probabilistic 
information  on  the  coverage  of  data  sources  has  been  added  (Florescu  et  al.,  1997). 

7.  Nimble  is  a  commercial  system  based  upon  XML  (Draper  et  al.,  2001) 

(see  also  http://www.nimble.com). 

8.  Tukwila:  The  Tukwila  information  integration  system  is  designed  to  handle  XML  data  and 
in  particular  streaming  data.  As  such,  Tukwila  can  provide  answers  while  the  data  are 
continuously  coming  (Ives  et  al.,  1999,  2000a,  2000b,  2001,  2003). 

9.  Softbot  Interface  to  Internet:  Etzioni  and  Weld  (1994)  describe  a  softbot  interface  to  the 
internet  which  makes  use  of  Al  planning  methods  to  satisfy  the  user’s  queries.  Etzioni 
(1996)  covers  general  issues  concerning  mining  the  Internet.  Theoretical  issues  pertaining 
to  constructing  plans  for  information  gathering  when  there  are  costs  (monetary  and 
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temporal)  on  accessing  different  information  sources  are  investigated  by  Etzioni  et  al. 
(1996). 

10.  Ariadne  (Knoblock  et  al.,  2002a)  is  a  system  for  extracting  and  integrating  data  from  web 
sources.  It  allows  users  to  rapidly  create  infonnation  agents  for  the  web. 

The  projects  previously  described  have  ranged  from  experimental  to  commercial.  They  have 
illustrated  a  variety  of  different  techniques.  But  separate  from  particular  projects  such  as  those 
described  previously,  there  has  been  quite  a  bit  of  work  on  general  issues  (both  theoretical  and 
practical  or  algorithmic)  that  arise  in  the  area  of  information  integration.  These  topics  are 
surveyed  in  the  following  paragraphs. 

Hull  (1997)  and  Ullman  (1997)  survey  approaches  towards  the  integration  of  information  from 
multiple  databases.  General  work  on  the  use  of  description  logics  for  information  integration  has 
been  carried  out  by  Lenzerini  and  collaborators  (Calvanese  et  al.,  1997a,  1998a,  1998c,  2002). 
They  developed  a  description  logic  (along  with  associated  inference  methods)  that  is  suitable  for 
representing  the  various  information  sources  as  well  as  the  global  or  mediated  domain.  The  logic 
can  express  the  relationship  between  concepts  from  the  different  domains.  Other  applications 
are  data  warehousing  (Calvanese  et  al.,  1998b),  handling  SGML  documents 
(Calvanese  et  al.,  1997b),  and  extensions  of  object-oriented  modeling  (Calvanese  et  al.,  1995). 
Additional  general  work  on  the  use  of  description  basics  in  information  integration  has  also  been 
carried  out  by  Beeri  et  al.  (1997). 

General  issues  concerning  query-answering  plans  for  data  integration  systems  are  considered  by 
Duschka,  et  al.,  (2000).  In  particular,  they  look  at  the  notion  of  a  maximally  contained  plan,  a 
plan  that  provides  all  the  answers  that  are  possible  to  obtain  from  the  available  sources,  but 
which  may  not  necessarily  be  equivalent  to  the  original  query.  The  authors  consider  the  cases 
when  recursive  plans  are  needed. 

Friedman  et  al.  (1999)  discuss  the  use  of  planning  techniques  to  gather  information.  The  user’s 
queries  are  translated  into  plans  over  information  sources.  The  efficiency  of  information 
gathering  plans  and  techniques  for  transfonning  plans  into  more  efficient  plans  are  considered  in 
a  number  of  papers  (Friedman  and  Weld,  1997;  Doan  and  HaLevy,  2002).  Issues  pertaining  to 
query  containment  are  considered  by  Milstein  et  al.  (2002).  Additionally,  work  at  ISI  has 
addressed  the  issue  of  planning  queries  in  the  presence  of  mediators  that  provide  access  to 
information  distributed  over  heterogeneous  sources  (Ambite  and  Knoblock,  2000). 

The  question  of  optimizing  query  plans  with  regard  to  cost  and  coverage  has  been  considered  by 
a  number  of  researchers.  Nie  and  Kambhampati  (2001)  consider  cost  in  terms  of  planning  cost. 
This  is  traded  off  the  coverage  of  the  plan.  Kambhampati  and  his  collaborators  have  also  looked 
at  utilizing  machine-learning  techniques  for  gathering  coverage  statistics  (Nie  and  Kambhampati, 
2003;  Nie  et  al.,  2002). 
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In  a  number  of  publications,  Bertossi,  Chomicki,  and  collaborators  look  at  the  computation  of 
consistent  answers  from  inconsistent  databases  (Bertossi  and  Chomicki,  2003;  Bravo  and 
Bertossi,  2003;  Arenas  et  ah,  2000,  2003;  Bertossi  et  ah,  2002).  They  are  particularly  concerned 
with  databases  that  violate  integrity  constraints.  The  approach  is  to  compute  the  minimal  repairs 
that  could  restore  consistency  to  the  database(s)  and  reason  over  this  new  set  of  databases. 

Machine  learning  techniques  have  been  proposed  for  learning  many  of  the  pieces  of  knowledge 
needed  by  information  integration  systems.  Examples  are  learning  the  mapping  between  source 
and  mediated  schema  (Doan  et  ah,  2000),  learning  mappings  between  ontologies 
(Doan  et  ah,  2002),  and  more  generally  using  statistical  methods  for  finding  mappings  between 
different  schemas  (HaLevy  and  Madhavan,  2003).  Kushmerick  (2000)  analyzes  a  number  of 
techniques  for  automatically  inducing  wrappers  for  web  sites.  Craven  et  al.  (2000)  developed  a 
trainable  system  that  extracts  infonnation  from  the  web  and  produces  a  knowledge  base. 

Cohen  (2000)  describes  a  language  called  WHIRL  that  combines  both  textual  (concepts  based  on 
information  retrieval  [I.R.])  and  logic-based  representation  systems  (e.g.,  datalog)  into  a  single 
language  for  information  representation.  The  language  is  useful  for  a  variety  of  information 
integration  tasks  including  the  generation  of  wrappers. 

Work  in  the  area  of  information  integration  often  makes  use  of  the  notions  of  mediators  and 
wrappers.  These  are  in  some  respect  similar  to  the  notion  of  agents.  There  are  different  pieces 
of  software  serving  particular  purposes,  but  it  is  a  relatively  passive  model.  The  mediators  and 
wrappers  do  not  monitor  their  environment  and  take  appropriate  actions  given  their  goals.  For 
time  varying  data,  it  seems  likely  that  the  more  active  model  of  agents  would  be  appropriate 
rather  than  the  more  passive  model  of  mediators  and  wrappers.  For  keeping  the  COP  up  to  date, 
sources  will  have  to  notify  the  COP  of  significant  changes.  How  they  would  determine  which 
changes  are  significant  is  an  open  question  at  this  point.  There  may  have  to  be  extensive 
communication  going  on  in  both  directions  between  the  CROP  and  the  wrappers. 

4.3  Ontologies  and  the  Semantic  Web 

This  section  covers  several  related  areas.  There  has  been  an  interest  in  ontologies,  a 
machine -understandable  specification  of  a  conceptualization  of  a  domain  (Gruber,  1993),  for 
some  time  in  the  area  of  artificial  intelligence  (AI)/knowledge-based  systems.  A  number  of 
languages  have  been  developed  to  represent  ontologies.  The  effort  known  as  the  semantic  web 
has  moved  this  work  onto  the  web  (using  XML)  so  that  the  tenninologies  defined  by  these 
ontologies  can  now  categorize  data  available  on  the  web. 

4.3.1  Ontologies 

Overviews  of  recent  work  on  ontologies  have  been  written  by  Fensel  (2001)  and  McGuinness 
(2003).  For  a  more  philosophical  perspective,  see  Guarino  (1998).  Of  particular  recent 
importance  is  the  convergence  between  web  technology  and  ontologies.  This  convergence  has 
resulted  in  efforts  to  create  ontology  languages  based  upon  XML  and  RDF.  The  construction 


17 


and  manipulation  of  ontologies  are  crucial  to  the  development  of  the  infonnation  system  to 
support  NCW  and  the  Future  Force.  Ontologies  provide  a  fonnal  specification  terminology  used 
to  categorize  objects  and  events.  Without  such  ontologies,  it  is  simply  not  possible  to 
automatically  manipulate  the  terminologies  so  defined  for  fusion  tasks. 

4.3.2  Knowledge  Representation  Systems/Languages 

There  are  a  number  of  knowledge  representation  languages  that  have  been  developed  to 
represent  ontologies.  Some  of  the  better  known  are  as  follows: 

1 .  Ontolingua  is  a  frame -based  language  designed  to  facilitate  the  use  of  ontologies  within 
multiple  applications. 

2.  Loom  was  developed  at  ISI  and  is  based  on  a  description  logic  or  classifier  system. 

3.  See  Reed  and  Lenat  (2002)  for  a  recent  discussion  of  applications  of  CYC.  See  Lenat  and 
Guha  (1989)  for  the  original  description  of  the  project.  The  CYC  project  includes  both  the 
development  of  an  ontology  language,  a  set  of  tools,  and  also  a  large  ontology  designed  to 
be  used  in  a  wide  variety  of  applications. 

4.  GKB  (Generic  Knowledge  Base)  Editor  (developed  at  SRI)  is  a  graphical  tool  for 
interactive  knowledge  base  browsing  and  editing  (Karp  et  ah,  1998). 

Karp  (1993)  has  produced  a  (now  somewhat  dated)  comparison  of  many  of  these  systems. 

Much  work  in  the  area  of  ontologies  is  based  upon  frame  languages.  A  more  flexible  alternative 
that  automatically  computes  the  classification  of  instances  is  description  logics.  The  literature  on 
description  logics  is  quite  large.  See  Calvanese  et  al.  (1998)  for  an  overview  of  their  use  in 
modeling.  A  book  length  overview  of  the  entire  area  of  description  logic  research  is  now 
available  (Bader  et  al.,  2003). 

Some  work,  especially  at  ISI,  has  been  done  on  developing  tools  to  assist  in  the  gathering  of 
knowledge  (Blythe  et  al.,  2001;  Blythe,  2001;  Gil  and  Ratnakar,  2002).  Some  of  the  projects  are 
as  follows: 

1.  Trellis  (Trellis,  Information  Sciences  Institute,  USC  [2004];  Gil,  2003). 

2.  Expect  (Expect,  Information  Sciences  Institute,  USC  [2004]). 

3.  Temple  (Temple,  Infonnation  Sciences  Institute,  USC  [2004]). 

4.3.3  Markup  Languages 

Markup  languages  such  as  HTML,  XML,  XML  Schema,  and  XHTML  have  become  ubiquitous 
today.  Standards  for  such  languages  are  set  by  the  W3C  organization  (http://www.w3c.org). 
There  are  numerous  books  describing  XML  (including  its  extensions  such  as  XML  Schema  and 
XHTML)  and  applications  to  electronic  commerce  (Martin  et  al.,  2000;  Schmelzer  et  al.,  2002). 
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4.3.4  Semantic  Web 

The  network-centric  environment  for  Army  knowledge  fusion  will  be  based  upon  the  semantic 
web.  Some  general  introductions  are  Berners-Lee  et  al.  (2002)  and  Fensel  (2003a,  2003b).  RDF 
(Lassila  and  Swick,  1999)  is  a  basic  ontology  language  based  upon  XML.  Many  of  the  semantic 
web  languages  are  based  upon  RDF.  See  Patel-Schneider  and  Fensel  (2002)  for  a  discussion  of 
various  issues  related  to  the  layering  of  languages  that  make  up  the  semantic  web. 

Heflin  et  al.  (2003)  discuss  the  web  markup  language  SHOE  (not  based  on  RDF).  McGuinness 
et  al.  (2003)  describe  the  language  DAML-ONT,  which  is  the  ontology  portion  of  DAML 
(DARPA  Agent  Markup  Language).  The  language  OIL  is  described  by  Klein  et  al.  (2003). 

Both  OIL  and  the  combination  of  DAML  and  OIL  (DAML+OIL)  are  discussed  by 
Fensel  et  al.  (2003b).  A  more  recent  language  being  considered  is  OWL  (Smith,  2003).  All  of 
these  languages  are  based  on  RDF  and  have  been  considered  by  the  W3C.  OWL  is  currently  a 
W3C  candidate  recommendation. 

Much  work  in  the  semantic  web  is  devoted  to  the  development  of  ontologies,  ontology 
languages,  and  support  tools  for  the  specification  of  the  information  available  on  web  pages  and 
also  for  appropriately  making  web  services  available  to  automated  agents.  For  discussions  of 
various  tools  see  (Omelayenko  et  al.,  1993;  Sure  and  Studer,  2003;  Klein  et  al.,  2003;  Engels  and 
Lech,  2003;  Sure  et  al.,  2003).  Lassila  and  Adler  (1993)  discuss  how  web  services  (semantically 
specified)  may  be  utilized  by  a  hand-held  (PDA-like)  device.  A  similar  device  may  be  utilized 
by  soldiers  in  the  field  to  access  services  available  on  the  network  (GIG). 

4.3.5  Intelligent  Search  and  Retrieval 

Maedche  et  al.  (2003)  describe  techniques  for  semantically  querying  and  navigating  repositories 
of  materials  annotated  with  a  semantic  markup  language.  There  has  also  been  work  on  the 
storage  and  retrieval  of  RDF  annotated  data.  See,  for  example,  Davies  et  al.  (2003)  and  also 
Broekstra  et  al.  (2003). 

Service  retrieval  (e.g.,  online  web  services)  is  an  important  topic  that  is  likely  to  play  a  major 
role  in  the  information  system  needed  to  support  the  Army’s  Future  Force.  One  approach,  due  to 
Klein  and  Bernstein,  is  based  upon  the  use  of  an  ontology  of  processes  (Klein  and  Bernstein 
2001;  Bernstein  and  Klein,  2002).  The  work  makes  use  of  an  approach  to  modeling  processes 
that  has  associated  with  it  a  number  of  tools  and  other  applications  (Klein  and  Dellarocas,  2000; 
Bernstein  et  al.,  1999;  Klein,  2003). 

4.3.6  Web-Site  Management 

A  tremendous  amount  of  information  will  be  available  throughout  the  GIG  for  inclusion  into  the 
CROP.  Automating  the  process  of  gathering  the  relevant  data  for  inclusion  is  a  major  issue. 
There  has  been  some  relevant  work  on  web-site  management.  In  particular  two  systems, 

Strudel  and  Tiramisu  have  been  developed  (Fernandez  et  al.,  1997a,  1997b,  2000; 

Anderson  et  al.,  1999).  These  allow  the  site  designer  to  specify  in  a  declarative  fashion  what 
data  should  be  included  in  the  website  and  how  that  data  should  be  visually  presented. 
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4.4  Advanced  Representation  and  Reasoning 

This  section  covers  a  number  of  topics  related  to  representation  and  reasoning  that  are  likely  to 
be  essential  to  the  success  of  NCW.  One  is  the  development  of  a  flexible  notion  of  plans.  This 
has  been  mentioned  earlier.  Additionally,  it  is  necessary  to  recognize  the  plan  of  enemy  forces 
given  the  evidence  available.  This  problem,  essential  to  the  vision  of  NCW,  has  been 
investigated  under  the  name  of  plan  recognition,  and  also  it  is  a  specific  application  of  abductive 
reasoning.  Other  types  of  relevant  reasoning  are  logical  reasoning  such  as  that  found  in  the 
literature  on  logic  programming,  and  also  probabilistic  or  uncertain  reasoning.  Finally,  real-time 
behavior  is  likely  to  be  important  in  many  cases,  while  these  reasoning  methods  often  do  not 
exhibit  real-time  behavior.  Work  in  the  area  of  real-time  problem  solving  addresses  the  issue  of 
obtaining  responses  in  a  time-constrained  setting. 

4.4.1  Plan  Representation 

The  literature  on  planning  within  AI  is  vast.  Additionally,  see  Allen  et  al.  (1991)  for  several 
innovative  articles  in  this  area.  Additionally,  Devanbu  and  Litman  (1996)  provide  a  description 
logic  (CLASP)  to  represent  plans.  The  notion  of  subsumption  now  applies  between  plan 
concepts  and  also  between  plan  concepts  and  plan  instances.  This  work  may  be  useful  in 
addressing  the  need  for  a  flexible  representation  for  plans  and  activities. 

GoLog  (Levesque  et  al.,  1997;  Reiter,  2001)  is  a  high-level  programming  language  for  agents 
based  on  the  situation  calculus.  The  language  includes  nondeterministic  constructs  and 
incorporates  deduction  into  the  planning  mechanism.  Again,  the  work  may  prove  useful  as  a 
basis  for  a  flexible  plan  representation  language. 

4.4.2  Plan  Recognition 

There  is  a  body  of  literature  on  plan  recognition  and  more  generally  on  abductive  reasoning. 
Here,  rather  than  constructing  a  plan  and  executing  it,  the  goal  is  to  determine  on  the  basis  of  the 
available  evidence,  what  is  the  plan  that  is  being  carried  out  by  perhaps  the  enemy  forces.  This 
work  will  enable  the  COP  to  present  information  on  the  likely  intent  and  future  positions  of  the 
enemy  forces. 

An  important  early  work  on  plan  recognition  is  put  forth  by  Kautz  (1991).  A  formal  theory  of 
plan  recognition  is  developed  in  which  the  problem  is  precisely  stated.  In  this  work,  the  goal  of 
the  plan  recognition  is  to  produce  the  most  specific  plan  that  accounts  for  only  the  observed 
observations.  Therefore,  it  does  not  seem  amenable  to  the  incorporation  of  other  types  of 
evidence  in  support  of  the  likelihood  of  alternative  plans. 

Charniak  and  Goldman  (1993)  present  a  method  of  plan  recognition  in  which  candidate 
explanations  (i.e.,  plans)  are  first  retrieved  and  then  are  assembled  into  a  Bayesian  network. 

Then  the  observed  actions  are  entered  and  Bayesian  updating  takes  place.  The  result  is  a 
probabilistic  ranking  of  candidate  plans  on  the  basis  of  their  likelihood  in  being  the  actual  plan. 
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Poole  (1993)  presents  a  framework  for  the  abduction  of  Horn  clauses,  based  on  the  combination 
of  a  Bayesian  network  with  Prolog.  In  Goldman  et  al.  (1999),  Poole’s  work  is  used  as  the  basis 
for  an  approach  to  plan  recognition.  These  approaches  seem  to  have  the  flexibility  to  be 
extended  to  allow  a  variety  of  types  of  evidence  to  affect  the  likelihood  ranking  of  plans. 

Closely  related  to  plan  recognition  is  the  abstraction  of  higher  level  concepts  from  temporal  data. 
Shahar  (1997)  gives  a  general  knowledge-based  framework  for  such  a  temporal  abstraction 
process.  Another  related  topic  is  work  on  finding  patterns  in  temporal  data  or  the  dividing  of 
sequences  into  meaningful  episodes  (Oates  et  al.,  1997;  Cohen,  2001;  Cohen  and  Adams,  2001). 

4.4.3  Abduction  and  Belief  Revision 

Plan  recognition  as  discussed  in  the  previous  section  is  closely  related  to  the  more  general  topic 
of  abductive  reasoning.  The  problem  is  to  infer  what  must  have  been  true  to  account  for  current 
observations.  Especially  relevant  is  abduction  in  the  context  of  the  occurrence  of  actions.  Some 
relevant  literature  is  Baral  (2000),  Poole  (1989),  and  Selman  and  Levesque  (1996). 

Josephson  and  Chandrasekaran  (2003)  have  proposed  an  architecture  for  an  information  system 
in  which  abductive  inference  plays  a  major  role. 

Another  related  area  is  belief  revision  (Gardenfors,  1988,  1992;  Goldszmidt  and  Pearl,  1996; 
Boutilier,  1998;  Friedman  and  Halpern,  1997).  The  issue  here  is  how  to  revise  our  beliefs  given 
new  information.  It  is  often  necessary  to  eliminate  beliefs  to  accommodate  the  new  infonnation 
and  there  may  be  different  possible  choices  that  can  be  made  with  regard  to  which  beliefs  to 
eliminate. 

4.4.4  Logic  Programming 

The  literature  on  logic  programming  is  quite  large.  Applications  include  abduction, 
nonmonotonic  reasoning,  planning,  reasoning  about  actions,  and  various  forms  of  problem 
solving.  Baral  and  Gelfond  (1994)  survey  the  use  of  logic  programming  for  knowledge 
representation.  The  recent  book  by  Baral  (2003)  covers  a  newer  logic  programming  language 
AnsProlog  which  is  especially  suitable  for  nonmonotonic  reasoning.  Reasoning  about  actions 
within  logic  programming  languages  has  been  a  topic  of  major  interest.  Baral,  Gelfond,  and 
collaborators  have  proposed  a  variety  of  action  description  languages  with  differing  capabilities 
(Baral  et  al.,  1997).  An  alternative  approach  is  the  use  of  GoLog  (Levesque  et  al.,  1997;  Reiter, 
2001),  mentioned  earlier. 

4.4.5  Uncertain  Reasoning 

The  literature  on  uncertain  or  probabilistic  reasoning  is  quite  large.  Many  of  the  techniques  are 
related  to  those  used  in  data  fusion  as  discussed  in  section  4.1.  Shafer  and  Pearl  (1990)  is  a 
collection  of  classic  papers  on  the  topic  of  uncertain  reasoning  in  general.  Pearl  (1998)  is  a 
book-length  introduction  to  the  combination  of  graphical  models  and  Bayesian  reasoning  known 
as  Bayesian  Networks.  The  work  of  Cowell  et  al.  (1999)  and  Jensen  (1996)  are  excellent 
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introductions  to  probabilistic  expert  systems.  Pearl  (2000)  is  a  book  devoted  to  exploring  the 
notion  of  causality,  but  in  the  process  lays  out  the  latest  thinking  of  the  author  on  Bayesian 
networks  and  probabilistic  reasoning  in  general. 

The  use  of  machine  learning  techniques  to  leam  a  type  of  Bayesian  network  ( Tree  Augmented 
Naive  Bayes)  useful  for  classification  problems  is  discussed  by  Friedman  et  al.  (1997).  Issues  of 
using  Bayesian  networks  to  cluster  dynamic  processes  are  considered  by  Ramoni  et  al.  (2001). 
These  techniques  are  potentially  useful  in  computing  the  most  probable  generating  processes  for 
some  observed  sensor  data  when  that  sensor  data  is  continuous.  Again,  these  techniques  would 
supplement  those  traditionally  used  in  data  fusion. 

In  a  number  of  papers,  A.  Pfeffer,  D.  Roller,  and  collaborators  combine  Bayesian  Networks  with 
the  types  of  more  expressive  structures  found  in  knowledge  representation  languages  (Roller 
et  al.,  1997;  Pfeffer  et  al.,  1999;  Roller  and  Pfeffer,  1997,  1998;  Pfeffer  and  Roller,  2000).  The 
result  is  probabilistic  versions  of  frame-based  languages  and  description  logics.  This  approach  is 
designed  to  be  suitable  for  representing  the  battlespace  domain.  A  related  approach,  but  with  an 
added  temporal  component,  has  been  proposed  recently  by  Sanghai  et  al.  (2003).  These  methods 
may  turn  out  to  be  important  in  integrating  the  results  of  data  fusion  with  higher  level  reasoning 
both  of  a  probabilistic  and  a  logical  nature. 

4.4.6  Real-Time  Problem  Solving 

Because  certain  computations  will  have  to  provide  an  answer  within  a  limited  time  period, 
computing  in  time-constrained  environments  is  of  importance.  Unfortunately,  many  of  the 
reasoning  methods  discussed  earlier  do  not  necessarily  have  good  time-constrained  behavior  and 
therefore  techniques  for  performing  reasoning  with  time  constraints  are  likely  to  be  important  in 
the  development  of  the  reasoning  methods  needed  to  support  those  aspects  of  Army  knowledge 
fusion  that  will  demand  real-time  behavior. 

A  classic  paper  is  by  Boddy  and  Dean  (1994)  in  which  anytime  algorithms  are  discussed.  An 
anytime  algorithm  will  provide  an  answer  at  any  point  in  the  computation  and  with  more  time  the 
quality  of  the  answer  will  only  improve.  The  composition  of  real-time  modules  is  discussed  by 
Zilberstein  and  Russell  (1966).  Other  aspects  are  considered  in  Hansen  and  Zilberstein  (2001) 
and  Horvitz  (2001).  Real-time  diagnosis  and  the  construction  of  plans  to  discriminate  among 
possible  hypotheses  are  discussed  in  Ash  and  Hayes-Roth  (1996). 

4.5  Contemporary  Database  Topics 

A  number  of  currently  active  research  areas  within  the  field  of  databases,  broadly  construed,  are 
potentially  quite  relevant  to  the  concerns  at  hand.  These  include  the  topics  of  querying  streams 
and  querying  unstructured  infonnation  sources.  Also  included  here  are  the  topics  of  data  mining, 
web  mining,  and  information  retrieval. 
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4.5.1  Querying  Streams 

Recently,  there  has  been  interest  within  the  field  of  databases  in  developing  methods  for 
querying  streams  of  continuous  data.  This  work  is  potentially  useful  for  monitoring  sensor  data 
or  information  on  credit  card  transactions.  Much  work  has  been  carried  out  by  J.  Widom  along 
with  students  and  collaborators  at  Stanford  University  (Arasu  et  ah,  2002,  2003;  Babcock  et  ah, 
2002;  Babu  and  Widom,  2001). 

Querying  streams  of  XML  data  is  considered  by  a  number  of  researchers  (Ives  et  al.,  2000b; 
Gupta  and  Suciu,  2003).  Related  issues  in  routing  XML  data  to  the  appropriate  place  are 
considered  by  Gupta  et  al.  (2003).  See  also  the  Tukwila  system  mentioned  earlier. 

The  ability  to  query  streams  is  a  critical  aspect  of  Army  knowledge  fusion.  Particular  sensors 
may  need  to  be  continuously  monitored  and  alarms  sounded  (i.e.,  messages  sent  to  the 
appropriate  people  or  agents)  when  significant  changes  occur.  For  homeland  security,  the 
querying  (monitoring)  of  streams  of  credit  card  transactions,  information  from  shipping 
manifests  or  passenger  lists,  also  is  a  useful  tool. 

4.5.2  Querying  Unstructured  Information  Sources 

Developing  the  capability  to  process  data  that  are  semistructured  is  an  important  aspect  of 
current  database  research.  These  are  data  that  do  not  conform  to  a  particular  schema  as  in  a 
database,  but  are  to  varying  degrees  self-describing.  Much  of  the  data  are  in  XML  format,  but  it 
could  also  be  in  HTML,  or  in  one  of  the  semantic  web  languages  discussed  earlier  in  section  4.3. 
It  is  to  be  expected  that  much  of  the  data  exchanged  over  the  network  supporting  NCW  will  be 
semistructured,  using  XML  or  something  similar.  Additionally,  data  available  on  the  web  are  in 
this  format.  The  capability  of  querying  such  data  is  crucial. 

An  excellent  introduction  to  work  in  this  area  is  the  book  by  Abiteboul  et  al.  (2000).  A  query 
language,  LOREL,  was  developed  as  part  of  the  TSIMMIS  Project  and  is  especially  designed  to 
query  semistructured  data.  The  queries  can  operate  over  data  having  different  types  and  also 
when  some  of  the  data  are  absent  (Abiteboul  et  al.,  1997).  This  query  language  was  migrated  to 
work  on  XML  (Goldman  et  al.,  2000).  Work  on  developing  efficient  algorithms  for  the 
evaluation  of  queries  over  semistructured  data  has  been  carried  out  by  Suciu  (2002).  A  general 
discussion  of  issues  involved  in  systems  developed  for  querying  XML  data  may  be  found  in 
Deutsch  et  al.  (1999).  Query  languages  for  querying  web  sites  are  surveyed  by  Florescu  et  al. 
(1998).  These  are  designed  for  HTML  pages.  Querying  RDF  and  RDF  Schema  is  discussed  by 
Broekstra  et  al.  (2003). 

4.5.3  Web  Mining 

Recently  there  has  been  a  growing  interest  in  web  mining;  extracting  useful  information  from 
web  pages.  Chakrabarti  (2003)  has  written  a  relatively  technical  survey  of  the  work  on  the  topic. 
A  less  technical  work  oriented  towards  business  applications  is  Linoff  and  Berry  (2001). 
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Knoblock  et  al.  (2002b)  apply  machine  learning  techniques  to  the  extraction  of  data  from  web 
pages.  Most  of  this  work  is  dealing  with  HTML/XML  pages,  rather  than  pages  annotated  in  the 
fashion  of  the  semantic  web.  If  the  pages  are  so  annotated,  then  the  mining  issues  are  rather 
different,  and  one  is  able  to  obtain  much  more  reliable  information. 

4.5.4  Data  Mining 

Data  mining  technology  clearly  has  high  relevance  to  Army  knowledge  fusion.  The  literature  on 
data  mining  is  large  and  growing  rapidly.  Good  surveys  of  the  available  techniques  in  this  area 
include  books  by  Witten  and  Frank  (1999)  and  Han  and  Kamber  (2001),  and  an  excellent  article 
by  Brown  (2002)  that  discusses  data  mining  in  the  context  of  work  in  the  area  of  data  fusion. 

4.5.5  Information  Retrieval 

The  literature  on  information  retrieval  is  also  vast  and  cannot  be  covered  here.  These  techniques 
are  of  high  importance  to  Army  knowledge  fusion  because  much  infonnation  is  available  in 
textual  sources. 

One  relatively  recent  and  excellent  survey  is  the  book  by  Baeza-Yates  and  Ribeiro-Neta  (1999). 
Woods  (1997)  has  been  working  on  an  interesting  approach  called  Conceptual  Indexing.  This 
approach  tackles  directly  the  paraphrase  problem  of  different  words  being  used  in  the  texts  and 
in  the  query.  He  represents  the  semantic  relationship  among  concepts,  morphological  structure, 
and  different  words.  The  resulting  taxonomy  is  then  used  in  retrieval. 

4.6  Architectures 

The  Army  is  moving  toward  network-centric  systems,  pervasive  use  of  sensors,  and  the 
introduction  of  robots  into  multiple  force  levels.  Hardware  and  software  architectures  are 
essential  for  systematically  structuring  systems  to  support  the  fusion  of  these  concepts  and  to 
achieve  the  vision  of  a  tightly  integrated  Future  Force  made  up  of  humans  and  machines  working 
together.  Large-scale  distributed  applications  using  decentralized  infrastructures,  distributed 
operating  systems,  multi-agent  architectures,  and  dynamic  adaptability  will  be  required.  In 
addition,  architecture  technologies  for  information  integration  that  enable  fusion  at  the 
application  level  are  needed.  Applications  will  need  to  interact  with  a  full  range  of  components, 
ranging  from  databases,  application  servers,  and  data  warehouses,  to  search  engines,  agents,  data 
streams,  and  sensors.  Classical  architecture  research  is  a  maturing  field,  but  software 
architectures  designed  to  address  the  full  range  of  integration  issues  required  to  support  the 
knowledge  fusion  required  for  the  Future  Force  are  only  now  emerging. 

Shaw  (2001)  has  written  an  excellent  analysis  of  the  evolution  and  prospects  of  software 
architecture  research.  The  JDL  data  fusion  model  (White,  1988;  Steinberg  et  al.,  1998)  describes 
an  architecture  that  involves  multiple  agent-like  nodes  communicating  with  each  other.  Software 
architecture  approaches  that  address  the  engineering  of  large,  complex  software  systems  include 
new  technologies  for  component  composition  and  software  connector  technology  for  component 
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interaction  (Mehta  et  al.,  2000).  In  the  following  sections,  we  discuss  several  related  topics  that 
pertain  to  the  distributed  architecture  of  GIG. 

4.6.1  Agents 

The  literature  on  agents  is  quite  large.  Only  a  few  relevant  pointers  can  be  mentioned  here. 
Subrahmanian  et  al.  (2000)  give  an  excellent  survey  of  much  of  the  work  in  this  area.  The 
remainder  of  the  book  covers  the  IMPACT  (Interactive  Maryland  Platform  for  Agents 
Collaborating  Together)  architecture  for  building  an  agent-based  system.  Bradshaw  (1997)  is  a 
collection  of  articles  covering  various  applications  of  agents. 

GoLog  (Levesque  et  al.,  1997;  Reiter,  2001)  is  a  high-level  programming  language  for  agents 
based  on  the  situation  calculus.  The  idea  is  that  rather  than  perform  planning,  an  agent  can 
search  for  a  legal  execution  sequence  of  a  high-level  plan/program  by  reasoning  about  the  effects 
and  preconditions  of  actions.  Knowledge  and  knowledge-producing  actions  have  been 
incorporated  into  the  language  (Scherl,  2003).  ConGolog  (De  Giacomo  et  al.,  2000)  is  a 
concurrent  version  of  GoLog.  It  includes  facilities  for  prioritizing  the  execution  of  concurrent 
processes,  interrupts  and  exogenous  actions. 

Lesser  and  co-authors  (Lesser  et  al.,  2000)  describe  an  agent  named  BIG,  designed  to  gather 
information  over  the  World  Wide  Web  in  support  of  a  decision  process.  One  of  its 
characteristics  is  the  capability  to  reason  about  the  resource  trade-offs  of  different  information 
gathering  approaches. 

4.6.2  Grids 

Recently  there  has  been  work  in  an  area  called  grid  computing.  The  issue  is  how  to  perform 
computation  in  a  distributed  environment  where  different  computational  resources  and  data 
sources  are  located  on  a  network  (see,  for  example,  Blythe  et  al.,  2003a,  2003b; 

Deehnan  et  al.,  1993).  Al  planning  techniques  are  used  to  rapidly  generate  plans  or  workflows 
for  solving  particular  computational  problems;  given  a  user  specification  of  the  desired  result.  It 
is  likely  that  similar  techniques  may  be  needed  to  answer  queries  over  the  network  established  to 
support  the  Future  Force. 

4.6.3  Web  Services 

Languages  developed  as  part  of  the  semantic  web  endeavor  not  only  categorize  infonnation,  but 
also  services  (Peer,  2002;  Paolucci  et  al.,  2002).  Daml-S,  the  portion  of  DAML+OIL  for 
describing  web  services,  is  discussed  by  Ankolekar  et  al.  (2002).  Mcllraith  and  Son  (2002)  use 
Golog  for  composing  web  services.  Tate  et  al.  (2003)  specify  agents  that  plan  to  achieve  goals 
by  utilizing  available  web  services. 
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4.6.4  Distributed  Architectures 


Architectures  for  information  integration  are  emerging  as  a  result  of  the  explosion  of  the  Internet 
to  address  the  problem  of  integrating  large-scale  application  systems  that  must  interact  with 
databases,  distributed  servers,  workflow  systems,  search  engines,  message  queues,  Web 
crawlers,  mining  and  analysis  packages,  and  a  large  variety  of  programming  interfaces  such  as 
ODBC,  JDBC,  Java  objects,  SQL,  XML,  WSDL,  SOAP,  etc.  (Roth  et  ah,  2002).  The  IRIS 
project  at  MIT  is  developing  a  novel  decentralized  infrastructure,  based  on  distributed  hash 
tables  that  will  enable  a  new  generation  of  large-scale  distributed  applications  (Dabek  et  ah, 
2003). 

In  a  distributed  setting,  processes/agents  need  to  communicate  with  other  processes/agents  even 
if  they  do  not  know  the  location,  identity,  or  even  the  presence  of  the  others  with  whom  they 
need  to  communicate.  Techniques  have  been  developed  to  decouple  the  producers  of 
information  and  services  from  those  who  consume  or  use  the  information  or  services.  One  such 
architecture  for  accomplishing  this  decoupling  is  the  publish/subscribe  paradigm.  It  provides  for 
mediators  that  enable  the  communication  between  publishers/producers  and  clients/subscribers. 

If  a  publisher/producer  has  information  or  a  service,  it  publishes  it  via  the  mediator  and  the 
mediator  communicates  the  fact  that  the  information  or  service  is  available  to  those  clients  who 
have  subscribed  to  receive  notices  of  the  appropriate  kind  of  events.  See  Eugster  et  al.  (2003)  for 
a  survey  of  work  on  the  publish/subscribe  paradigm.  An  alternate  (but  related)  model  for  a 
loosely  coupled  distributed  system  based  on  the  concepts  of  event  generation,  observation,  and 
notification  is  discussed  by  Rosenblum  and  Wolf  (1997). 

Minsky  and  collaborators  (Minsky  and  Ungureanu,  2000;  Murata,  2003)  have  developed  a 
mechanism  for  regulating  such  open  distributed  systems.  This  method  is  called  Law  Governed 
Interaction  (LGI)  and  provides  a  scalable  method  for  ensuring  that  such  a  system  will  continue  to 
maintain  desired  properties  even  as  new  agents  are  added  and  existing  agents  are  deleted  or  fail. 


5.  Analysis  and  Summary 


The  construction  of  infonnation  systems  that  reliably  support  the  Future  Force  concept  depends 
upon  fundamental  advances  in  the  field  of  knowledge  fusion.  The  Future  Force  concept  is  not 
achievable  solely  using  today’s  technologies.  These  advances  require  that  the  state  of  the  art  in 
the  areas  identified  in  this  report  be  advanced  and  approached  from  a  multidisciplinary 
perspective.  Specifically,  it  will  be  necessary  to  bring  these  different  areas  together,  to  combine 
the  different  approaches  into  a  single  infonnation  system  in  which  different  pieces  of 
information  are  analyzed  with  a  variety  of  techniques,  and  the  results  are  fused  together  into 
knowledge  of  the  overall  picture.  Systems  that  achieve  portions  of  this  goal  today  are  developed 
in  an  ad  hoc  manner,  making  such  systems  unreliable  and  unflexible,  often  unable  to  scale-up  or 
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adapt  to  dynamic  changes  in  a  real-life  military  environment.  The  potential  for  rapidly 
introducing  specific  areas  of  new  fundamental  research  and  combining  this  research  with 
components  ready  or  near-ready  for  implementation  today  is  strong. 

Currently,  representation  and  reasoning  have  not  played  a  large  role  in  data  fusion  research. 
Going  forward,  it  is  essential  that  these  approaches  become  a  major  part  of  knowledge  fusion 
research  in  order  to  gain  fundamental  understanding  of  the  problem  and  find  reliable  solutions. 
This  report  has  considered  a  variety  of  languages  for  representing  knowledge  and  infonnation. 

In  many  cases,  these  languages  have  been  provided  with  a  formally  defined  syntax  and 
semantics.  Reasoning  methods  and  tools  have  been  developed  for  these  languages. 

In  particular,  the  following  areas  are  well  developed: 

1 .  Information  Integration  in  Databases:  How  to  automatically  give  a  global  view  to 
information  available  in  heterogeneous  and  distributed  databases.  Techniques  have  been 
developed  to  handle  differing  schemas  and  terminology. 

2.  Ontologies  and  languages  and  tools  to  support  their  use:  How  to  formally  define  the 
meaning  of  tenninologies  so  that  varying  tenninologies  can  be  manipulated  by  machines. 

3.  Semantic  Web:  The  development  of  languages  to  make  information  available  on  the  web 
(both  web  pages  and  web  services)  interpretable  by  machines. 

This  is  not  to  say  that  work  in  the  above  areas  is  completed.  There  is  still  much  to  be  done.  But 
some  problems  have  been  isolated,  and  reasonable  solutions  have  been  developed.  Work  in 
these  areas  is  mature  enough  to  begin  the  process  of  technology  transition  for  some  aspects  of  the 
knowledge  fusion  problem. 

Integration  of  the  high-level  language-oriented  reasoning  methods  and  the  data  oriented 
mathematical  techniques  of  data  fusion  remains  a  fundamental  research  problem.  Some  potential 
useful  approaches  (pointed  out  earlier)  are  the  work  of  D.  Brown  and  collaborators  on 
incorporating  world  knowledge  into  techniques  for  tracking  objects  and  also  the  work  of 
D.  Roller,  A.  Pfeffer,  and  others  on  creating  probabilistic  versions  of  a  number  of 
knowledge-representation  languages.  More  work  is  needed  on  integrating  the  infonnation 
obtainable  from  these  different  levels. 

A  great  deal  of  fundamental  research  in  architectures  for  data  fusion  remains  to  be  done. 

Specific  approaches  for  developing  large-scale,  distributed  architectures  that  will  support  the 
Future  Force  vision  at  all  force  levels  needs  to  be  researched.  Work  in  the  following  areas  can 
be  drawn  upon: 

1.  Agents:  How  to  build  systems  of  multiple  interacting  programs. 

2.  Grids:  How  to  perform  distributed  computations. 
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3.  Web  Services:  Incorporation  of  services  into  the  semantic  web. 

4.  Distributed  Systems:  How  to  create  robust  distributed  architectures. 

5.  Streaming  Data:  How  to  summarize,  simplify,  identify  critical  change  and  critical  objects 
in  real-time  streams  of  sensor,  video,  audio,  messaging  data. 

6.  Information  Integration:  How  to  design  and  manage  the  integration  of  heterogeneous 
application  components  and  their  interactions. 

There  are  a  number  of  areas  of  reasoning  and  representation  in  which  more  work  is  needed  for 
successful  knowledge  fusion: 

1.  The  detennination  of  the  goals  of  the  various  human  entities  (e.g.,  enemy  forces)  is  a  major 
problem,  and  the  state  of  current  knowledge  is  not  sufficient  to  easily  address  the  problem. 
On  the  basis  of  available  evidence,  it  is  necessary  to  identify  the  likely  plan  of  different 
agents  in  the  battlesphere.  More  work  is  needed  in  plan  recognition,  situation,  impact 
assessment  and  awareness,  and  abductive  reasoning.  These  issues  are  clearly  related  to  the 
handling  of  uncertainty  and  also  the  representation  of  information. 

2.  It  is  not  clear  how  to  combine  the  lower  level  sensor  readings  and  the  statistical  data  fusion 
predictions  with  the  higher  level  representation  of  plans,  real-time  information  from  the 
web  and  other  sources.  More  work  is  needed  to  address  this  issue  both  by  developing 
appropriate  fusion  architectures  and  new  forms  of  representation,  including  methods  for 
representing  audio  and  video  content. 

3.  Work  is  needed  on  how  to  properly  integrate  textual  information  into  the  overall  CROP. 
This  infonnation  may,  for  example,  be  reports  of  human  observers. 

4.  More  work  is  needed  on  how  to  handle  uncertainty  in  this  setting.  Information  may  be 
conflicting.  It  is  necessary  to  manage  multiple  competing  hypotheses  and  update  the 
probabilities  of  each  based  upon  newly  arriving  information. 

5.  A  flexible  method  of  representing  actions,  decisions,  and  plans  is  needed.  This  includes 
both  the  hypothesized  actions  and  plans  of  the  enemy  forces  and  representations  to  be 
given  to  agents  or  battlespace  entities.  One  of  the  ideas  behind  NCW  is  that  commanders 
do  not  have  to  issue  top-down  detailed  instructions,  but  can  rely  on  the  local  units  to  act  in 
accordance  with  their  knowledge  to  achieve  the  desired  behavior.  But  the  specification 
(and  monitoring)  of  the  action  or  plan  to  be  given  to  such  a  system  of  knowledgeable  actors 
is  an  open  question. 
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